[AMD] feat: MiniMax M3 day-zero benchmark for MI325X by cquil11 · Pull Request #1748 · SemiAnalysisAI/InferenceX

cquil11 · 2026-06-13T23:48:29Z

Summary

add minimaxm3-fp8-mi325x-vllm for MiniMax M3 MXFP8 on MI325X
use vllm/vllm-openai-rocm:minimax-m3 and the official MI325X MXFP8 recipe shape
mirror the H200 non-MTP search space: TP4/TP8 latency, TP4/TP8 expert parallelism, and TP8 data-parallel attention across 1k1k and 8k1k
route Hugging Face cache to node-local /local-nvme/hf-hub-cache/ and runtime compiler caches to container-local /tmp
disable prefix caching for random-dataset benchmarks
mount /dev/kfd and /dev/dri explicitly for ROCm
use the default BF16 KV cache because FP8 KV corrupts MiniMax M3 generation on this MI325X/gfx942 image

Recipe Alignment

model: MiniMaxAI/MiniMax-M3-MXFP8
image: vllm/vllm-openai-rocm:minimax-m3
--block-size 128
--attention-backend TRITON_ATTN
--language-model-only
--no-enable-prefix-caching
MiniMax M3 tool/reasoning parsers with automatic tool choice
no MI355X-specific --enforce-eager workaround

Upstream reference: https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3?hardware=mi325x&variant=mxfp8

Validation

Representative throughput smoke: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27482912444

vLLM started successfully on MI325X with the PR command
model downloaded through node-local /local-nvme/hf-hub-cache
all checkpoint shards loaded; CUDA graph capture completed
40 random 1k1k requests at TP4 / EP1 / concurrency 4 completed per runner
result processing, result upload, server-log upload, GPU-metrics upload, aggregation, and success-rate calculation succeeded

Targeted DPA accuracy validation: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27484953170

exact failed full-sweep point: TP1 x DP8 + EP, concurrency 512, 8k1k eval-only
server startup and all 1,319 GSM8K requests completed
GSM8K strict exact match: 0.9575
GSM8K flexible exact match: 0.9568
score validation, server-log upload, GPU-metrics upload, eval artifact upload, aggregation, and success-rate calculation succeeded

Failure diagnosis: --kv-cache-dtype fp8 produced deterministic repetitive/cross-prompt generation corruption and 1-2% GSM8K. On the same node, image, weights, and layouts, removing only FP8 KV restored correct generation with and without CUDA graphs. The PR therefore leaves KV cache at vLLM's default dtype.

Additional validation:

shell syntax, YAML parsing, matrix generation, and git diff --check pass
matrix matches the H200-aligned 31-point search space
/enroot resolves to local NVMe on every healthy compute node
XDG_CACHE_HOME and TRITON_CACHE_DIR use per-job local paths, avoiding stale NFS compiler artifacts

Full PR sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27485135330

Changelog Integrity

perf-changelog.yaml is current main byte-for-byte followed only by this PR's entry at the tail.

Note

Low Risk
Benchmark and CI runner configuration only; MI325X launcher changes affect cache paths and GPU device visibility but are scoped to the AMD Slurm launch path.

Overview
Adds MiniMax-M3 MXFP8 single-node vLLM benchmarking on MI325X via a new minimaxm3-fp8-mi325x-vllm matrix entry, a minimaxm3_fp8_mi325x.sh runner aligned to the official MI325X recipe (vllm/vllm-openai-rocm:minimax-m3, block size 128, TRITON_ATTN, MiniMax parsers, default BF16 KV), and an H200-style search space (TP4/TP8, EP, TP8 DPA) for 1k1k and 8k1k.

launch_mi325x-amds.sh is updated for all MI325X jobs: Hugging Face hub cache moves from NFS to /local-nvme/hf-hub-cache/, per-job XDG_CACHE_HOME and TRITON_CACHE_DIR under /tmp, and explicit /dev/kfd / /dev/dri mounts for ROCm in the container.

^{Reviewed by Cursor Bugbot for commit 6abc71f. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-13T23:48:43Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-13T23:49:04Z

Claude finished @cquil11's task in 2m 46s —— View job

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Read changed files and gather context
Validate master config and perf-changelog
Validate benchmark script code style and patterns
Validate launch script changes
Post review findings

LGTM — no blocking issues found.

Validation summary:

Config + perf-changelog both updated ✅
Perf-changelog entry appended at end of file ✅
Image vllm/vllm-openai-rocm:minimax-m3 is publicly accessible (Docker Hub) and matches MI355X ✅
Launch script has enroot import docker:// pattern for reproducibility ✅
Expert parallelism is correctly conditional on EP_SIZE / DP_ATTENTION ✅
Server launch command is properly formatted with separate lines ✅
MI325X search space is an exact match of the H200 non-MTP search space (verified both 1k1k and 8k1k) ✅
Benchmark script is functionally identical to the MI355X variant (minimaxm3_fp8_mi355x.sh), differing only in the header comment ✅

Note (non-blocking): The launch_mi325x-amds.sh changes (HF cache path to /local_nvme/ and /dev/kfd+/dev/dri mounts) apply to all MI325X benchmarks, not just MiniMax M3. The device mounts are additive so shouldn't break existing jobs, and moving to node-local NVMe is a reasonable infra choice for faster model loading — just noting the scope.

claude · 2026-06-13T23:52:29Z

Claude finished @cquil11's task in 1m 10s —— View job

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Read changed files and gather context
Validate master config and perf-changelog
Validate container image accessibility
Validate benchmark script (code style, EP pattern, server formatting)
Validate launch script (enroot import, device mounts)
Validate model prefix
Post review findings

LGTM — no blocking issues found.

Validation summary:

Master config + perf-changelog: Both updated, perf-changelog entry correctly appended at EOF ✅
Model prefix: minimaxm3 matches existing entries across AMD and NVIDIA configs ✅
Container image: vllm/vllm-openai-rocm:minimax-m3 on Docker Hub — publicly accessible ✅
Enroot import: launch_mi325x-amds.sh has the enroot import docker:// pattern ✅
Expert parallelism: Correctly conditional on EP_SIZE / DP_ATTENTION with the standard pattern ✅
Server command formatting: Arguments on separate lines with \ continuations ✅
H200 search space alignment: MI325X search space is an exact match of the H200 non-MTP search space (both 1k1k and 8k1k) ✅
MI355X vs MI325X diff: Only differences are the header comment and removal of --enforce-eager (intentional per PR description — MI355X-specific workaround) ✅

Note (non-blocking): The launch_mi325x-amds.sh changes (HF cache path to /local_nvme/ and /dev/kfd+/dev/dri mounts) apply to all MI325X benchmarks, not just MiniMax M3. The device mounts are additive and the NVMe cache path is a reasonable infra choice — just noting the blast radius.

github-actions · 2026-06-14T00:12:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27483150384
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27483150384

github-actions · 2026-06-14T00:30:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27483150384
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27483150384

github-actions · 2026-06-14T01:28:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27483489611
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27483489611

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5ec3e11. Configure here.}

cursor · 2026-06-14T01:50:03Z

+    --no-enable-prefix-caching \
+    --tool-call-parser minimax_m3 \
+    --reasoning-parser minimax_m3 \
+    --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &


Missing FP8 KV cache flag

Medium Severity

The new MI325X vllm serve invocation omits --kv-cache-dtype fp8 even though the PR recipe alignment, changelog, and the existing minimaxm3_fp8_mi355x.sh baseline all specify FP8 KV cache. Without it, vLLM may use a non-FP8 KV default, skewing memory headroom and throughput versus the official MI325X MXFP8 recipe and other MiniMax M3 entries.

^{Reviewed by Cursor Bugbot for commit 5ec3e11. Configure here.}

github-actions · 2026-06-14T03:32:51Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27485135330
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27485135330

github-actions · 2026-06-14T03:47:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27487160980
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27487160980

github-actions · 2026-06-14T04:51:40Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27487160980
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27487160980

github-actions · 2026-06-14T04:55:23Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27487160980
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27487160980

github-actions · 2026-06-14T05:23:30Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27489021974
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27489021974

Rebased onto main: MiniMax-M3 MXFP8 MI325X day-zero recipe (script + amd-master entry + perf-changelog + mi325x launcher tuning), plus VLLM_USE_BREAKABLE_CUDAGRAPH=0 so the recipe runs with CUDA graphs. Consolidated the branch's commits onto current main (which now carries the mi300x non-MTP/MTP recipes) to resolve the amd-master/changelog EOF-append conflicts. Co-Authored-By: functionstackx <47992694+functionstackx@users.noreply.github.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-14T07:27:28Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27487160980
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27487160980

github-actions · 2026-06-14T09:02:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27491836349
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27491836349

github-project-automation Bot added this to InferenceMAX Board Jun 13, 2026

cquil11 marked this pull request as ready for review June 13, 2026 23:48

cquil11 requested a review from a team June 13, 2026 23:48

cquil11 requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 13, 2026 23:48

cquil11 marked this pull request as draft June 13, 2026 23:48

cquil11 marked this pull request as ready for review June 13, 2026 23:49

cquil11 marked this pull request as draft June 13, 2026 23:50

cquil11 marked this pull request as ready for review June 13, 2026 23:51

cquil11 added the full-sweep-fail-fast label Jun 14, 2026

cursor Bot reviewed Jun 14, 2026

View reviewed changes

functionstackx force-pushed the codex/minimaxm3-mi325x-dayzero branch from f78392f to 6abc71f Compare June 14, 2026 07:24

Conversation

cquil11 commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Recipe Alignment

Validation

Changelog Integrity

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

claude Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Uh oh!

claude Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 14, 2026

Choose a reason for hiding this comment

Missing FP8 KV cache flag

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cquil11 commented Jun 13, 2026 •

edited by cursor Bot

Loading

claude Bot commented Jun 13, 2026 •

edited

Loading

claude Bot commented Jun 13, 2026 •

edited

Loading